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ABSTRACT 

We  report  on  the  University  of  Lugano’s  participation  in 
the  Blog  track  of  TREC  2009.  In  particular  we  describe  our 
system  for  performing  blog  distillation,  faceted  search  and 
top  stories  identification. 

1.  INTRODUCTION 

Recently,  user  generated  data  is  growing  rapidly  and  be¬ 
coming  one  of  the  most  important  source  of  information  in 
the  web.  This  data  has  a  lot  of  information  to  be  pro¬ 
cessed  like  opinion,  experience, etc  which  can  be  useful  in 
many  applications.  Forums,  mailing  lists,  on-line  discus¬ 
sions,  community  question  answering  sites  and  social  net¬ 
works  like  facebook  are  some  of  these  data  resources  that 
have  attracted  researchers  lately. 

Blogosphere  (the  collection  of  blogs  on  the  web)  is  one  of 
the  main  source  of  information  in  this  category.  Millions 
of  people  write  about  their  experience  and  opinion  in  their 
blogs  everyday,  and  this  provides  a  huge  amount  of  informa¬ 
tion  to  be  processed.  Due  to  the  importance  of  this  informa¬ 
tion,  TREC  (Text  REtrieval  Conference)  has  started  a  new 
track  for  blog  analysis  including  opinion  detection,  polarity 
mining  and  blog  distillation  [7,  11]. 

In  the  remainder  of  this  paper  we  will  explain  our  ap¬ 
proach  in  faceted  blog  distillation  in  section  2.  Our  ap¬ 
proach  to  top  stories  identification  is  explained  in  section  3. 
We  provide  conclusions  in  section  4. 

2.  BOG  DISTILLATION 

Blog  distillation  is  the  problem  of  retrieving  relevant  blogs 
(as  a  collection  of  posts)  to  a  given  query.  The  blog  distil¬ 
lation  task  has  been  approached  from  many  different  points 
of  view.  In  [3],  the  authors  view  it  as  ad-hoc  search  and 
consider  each  blog  as  a  long  document  created  by  concate¬ 
nating  all  postings  together.  Other  researchers  treat  it  as 
the  resource  ranking  problem  in  federated  search  [4].  They 
view  the  blog  search  problem  as  the  task  of  ranking  collec¬ 


tions  of  blog  posts  rather  than  single  documents.  A  similar 
approach  has  been  used  in  [12],  where  they  again  consider 
a  blog  as  a  collection  of  postings  and  use  resource  selection 
approaches.  Their  intuition  is  that  finding  relevant  blogs  is 
similar  to  finding  relevant  collections  in  a  distributed  search 
environment.  In  [8],  the  authors  modelled  blog  distillation 
as  an  expert  search  problem  and  use  a  voting  model  for 
tackling  it. 

2.1  Ordered  Weighted  Averaging  Operators 
in  Combining  Scores 

The  ordered  weighted  averaging  operator,  commonly 
called  OWA  operator,  was  introduced  by  Yager  [13].  OWA 
provides  a  parametrized  class  of  mean  type  aggregation 
operators,  that  can  generate  OR  operator  {Max),  AND 
operator(AIm)  and  any  other  aggregation  operator  between 
them. 

An  OWA  operator  of  dimension  n  is  a  mapping  F  :  R™  — > 
R  that  has  an  associated  weighting  vector  W, 

W  =  [wi,W2,  ...,  Wn]T 

such  that 

n 

=  1,  0  <  Wj,  <  1 , 

i= 1 

and  where 

n 

F(a,i, an)  =  Wibi  (1) 

i=  1 

where  bi  is  the  ith  largest  element  in  the  collection  ai, ...,  a„. 
There  are  different  methods  for  indicating  weighting  vector 
W.  We  use  a  quantifier  based  method  introduced  by  Yager 

[13]- 

OWA  operator  has  different  behaviours  based  on  the 
weighting  vector  associated  with  it.  Yager  introduced  two 
measure  for  characterizing  OWA  operator  [13].  The  first  one 
is  called  orness  and  is  defined  as: 

orness(W)  =  - -  (n  —  i)wi  (2) 

i= 1 

orness(W)  €  [0, 1] 

which  characterizes  the  degree  to  which  the  operator  behaves 
like  an  or  operator.  The  second  measure  is  dispersion  and 
is  defined  as 

n 

dispersion(W)  =  —  Wi  ln(w,) 

i=  1 


(3) 
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and  measures  the  degree  to  which  OWA  operator  takes  into 
account  all  information  in  the  aggregation. 

For  applying  OWA  operator  to  the  problem,  one  impor¬ 
tant  issue  is  determining  weighting  vector.  Yager  introduced 
a  method  based  on  linguistic  quantifiers  for  obtaining  this 
weights: 

Wi  =  Q(.-)-Q(  — )>  *  =  l,2,...,n  (4) 

n  n 

where  n  is  the  number  of  operands  to  be  combined,  and 
Q  is  the  fuzzy  linguistic  quantifier.  We  use  the  following 
definition  for  the  Q  function  as  suggested  by  Zadeh[15]: 

(  0,  if  r  <  a 

Q{r)  =  <  if  a  <  r  <  b  (5) 

[  1,  if  r  >  b 

with  a,b,r  £  [0, 1].  We  used  parameter  (o,  b )  with  three  dif¬ 
ferent  values,  (0,0.5),  (0.3,  0.8)  and  (0.5, 1),  as  three  quanti¬ 
fiers  with  different  levels  of  orness.  Table  1  shows  orness  and 
dispersion  for  each  quantifier  with  values  of  5,  10,  20  for  n. 
In  this  model,  n  is  the  number  of  top  relevant  posts  in  each 
blog  that  we  want  to  aggregate  their  relevance  score.  These 
relevance  scores  are  calculated  by  BM25  model  in  terrier  for 
posts  in  each  blog.  Figure  1  and  Figure  2  show  Mean  Av¬ 
erage  Precision(MAP)  and  Precision  at  10  for  experiments 
over  TREC07  datasets.  These  results  reveal  that  a  fixed 
number  of  highly  relevant  posts  in  each  blog  is  a  reliable  ev¬ 
idence,  using  which  can  result  in  an  effective  blog  retrieval 
system. 

2.2  Regularizing  Relevance  Scores 

Score  regularization  is  a  way  of  re-calibrating  relevance 
scores  for  documents  based  on  the  relationship  between 
them.  The  idea  behind  score  regularization  is  that  in  ac¬ 
cordance  with  the  Clustering  Hypothesis,  related  documents 
should  have  similar  scores  for  the  same  query.  The  authors 
of  [2,  10]  propose  general  models  for  smoothing  document 
scores  based  on  this  hypothesis.  In  [2],  Diaz  models  the 
problem  in  terms  of  optimization.  The  goal  is  to  calcu¬ 
late  for  each  document  a  new  (smoothed)  score  with  two 
contending  objectives:  score  consistency  with  related  docu¬ 
ments  and  score  consistency  with  the  initial  retrieval  score. 
Diaz  defines  a  cost  function  £(/)  as  follows: 

C  (/)  =  °  (/)  +  /*£(/) 

=  yZ(wafi  ~  wafjf  +  -  Vif  ^ 

* 

Here  /  is  a  vector  of  regularized  scores  over  n  documents, 
<t(/)  is  a  cost  function  associated  with  the  inter-document 
consistency  of  the  scores;  if  related  documents  have  incon¬ 
sistent  scores,  the  value  of  this  function  will  be  high.  A 
second  cost  function  e(/)  measures  the  consistency  with  the 
original  scores;  if  document  scores  are  inconsistent  with  the 
original  scores,  the  value  of  this  function  will  be  high.  A 
regularization  parameter  fi  controls  the  trade  off  between 
inter-document  smoothing  and  consistency  with  the  origi¬ 
nal  score  vector  y.  The  coefficient  Wij  in  the  expansion  of 
<t(/)  weights  the  score  of  the  ith  document  by  its  similarity 
to  the  _)th  document  and  is  calculated  by  normalizing  (and 
taking  the  square  root  of)  values  from  a  symmetric  affinity 


Table  2:  Regularization  Results  for  TREC07  and 


TREC08  query  sets. 


Model 

MAP 

P@10 

nDCG 

Bpref 

TREC07  query  sets 
TREC08  query  sets 

0.3126 

0.2375 

0.4956 

0.3480 

0.5483 

0.6990 

0.3118 

0.2196 

matrix  W  as  follows: 


Here  Wij  denotes  the  similarity  between  documents  i  and 
j.  In  order  to  keep  the  affinity  matrix  sparse,  only  the  k 
most  similar  documents  j  for  each  document  i  have  non¬ 
zero  Wij  values1.  The  diagonal  values  in  the  matrix  Wa 
are  defined  to  be  zero.  An  iterative  solution  for  the  above 
defined  optimization  problem  is  the  following: 

ft+1  =  (l-a)y  +  aWf1  (8) 

Where  a  =  y)  is  a  parameter,  y  =  f°  is  the  initial 

score  vector,  /*  is  the  score  vector  after  t  iterations  and  W 
is  a  normalized  affinity  matrix  such  that  Wij  =  WijWji.  The 
closed  form  solution  of  this  problem  is  given  by: 

r  =  {I-aW)-'y  (9) 

We  used  this  equation  in  our  experiments.  We  note  that 
we  did  not  introduce  a  new  model  here,  but  simply  inves¬ 
tigated  the  application  of  graph-based  regularization  frame¬ 
works  [2,  10]  to  the  problem  of  blog  distillation,  where  the 
aim  is  not  just  to  rank  documents,  but  to  rank  blogs  which 
are  themselves  composed  of  many  documents  (posts). 

Based  on  this  method  we  regularize  relevance  score,  which 
could  be  the  score  of  the  posts  or  the  score  of  the  blog  as  a 
whole.  In  case  of  posts  relevance  score  we  have  to  aggregate 
regularized  scores  again,  where  we  use  simple  averaging  as 
the  aggregation.  And  in  case  of  regularizing  blog  score  as  a 
whole,  we  generate  one  document  per  blog  which  is  concate¬ 
nation  of  its  most  relevant  posts.  We  use  the  similarity  score 
of  this  large  document  as  the  blog  relevance  score  and  use 
it  on  regularization.  Table  2  shows  the  results  of  posts  rele¬ 
vance  score  regularization  over  BlogOG  dataset  with  TREC07 
and  TREC08  query  sets. 

2.3  Faceted  Search 

For  the  faceted  rankings,  we  first  generated  positive  and 
negative  facet  scores  for  each  retrieved  document,  denoted 
pos(d)  and  neg(d)  respectively.  These  facet  scores  induce  a 
ranking,  denoted  rpos(d,  q)  and  rneg(d,  q),  which  we  combined 
with  the  original  relevance  ranking  rlel(d,  q)  using  the  Borda 
Fuse  aggregation  method  as  follows:2 

scoreBF{d,q)  =  a  rrel(d,q)  +  (1  -  a)  rfacet(d,  q)  (10) 

Without  any  training  data  (i.e.  relevance  judgments)  we 
were  unable  to  choose  an  appropriate  value  for  the  weighting 
coefficient  a  and  thus  set  its  value  to  0.5. 

1Some  documents  may  need  to  have  more  than  k  non-zero 
affinity  values  in  order  to  keep  the  matrix  symmetric. 

2Note  that  whenever  there  are  ties  in  the  ranking,  (i.e.  doc¬ 
uments  di  and  d2  have  the  same  score),  then  the  rank  for 
those  documents  is  the  average  of  the  (total  order)  ranking. 


Table  1:  Orness  and  dispersion  for  experimented  quantifiers  in  OWA  operator 


orness 

dispersion 

linguistic  quantifier 

n=10 

n=10 

a=0.0  ,  b=0.5 

0.77 

1.609 

At  least  half 

a=0.3  ,  b=0.8 

0.44 

1.609 

Most 

a=0.5  ,  b=1.0 

0.22 

1.609 

As  many  as  possible 

Figure  1:  Mean  Average  Precision 


Figure  2:  Precision  at  10 


2.3.1  In-depth  versus  Shallow 

For  the  in-depth  versus  shallow  facet,  we  calculated  the 
Cross  Entropy  (CE)  between  each  retrieved  document  and 
the  collection  as  a  whole.  We  used  CE  as  the  positive  score 
for  the  positive  ( in-depth )  facet  value  since  high  CE  indi¬ 
cates  that  the  document  contains  many  rare  and  informative 
words: 


pos(d) 


CE(p(.\d),p(.\c))  =  ^p(t\d)  log 

ted 


1 

p(t\c) 


(11) 


Here  p{t\d)  is  the  probability  of  a  term  t  appearing  within 
the  document  d,  which  we  calculate  using  the  relative  term 
frequency  as  follows:  p(t\d)  =  tf  (t,  d)/ '^2t,  tf  (t1 ,  d),  where 
tf  (t,  d)  is  the  absolute  term  frequency.  Meanwhile  p{t\c)  de¬ 
notes  is  the  probability  of  a  term  across  the  whole  collection 
c,  for  which  we  use  a  document  frequency  based  estimate 
p(t|c)  =  df(t)/|e|  where  |c|  is  the  number  of  documents  in 
the  collection.  Our  rational  for  using  a  df  rather  than  tf 
based  estimate  is  that  the  former  appears  less  susceptible  to 
noise  from  spam  documents,  which  oftentimes  include  terms 
with  very  high  frequency  (high  tf  values). 

For  the  negative  ( shallow )  facet  score  we  simply  use  the 
negation  of  the  CE,  i.e.  neg(d)  =  —pos(d). 


2.3.2  Opinion  versus  Factual 
For  the  opinion  versus  factual  facet,  we  built  lexi¬ 
cons  of  opinionated  and  objective  words  using  the  TREC 
Blog06  collection  and  corresponding  relevance /opionion 
judgments.  In  the  lexicon,  terms  were  weighted  according 
to  a  document-frequency  based  version  of  the  Mutual  Infor¬ 
mation  (MI)  metric  [9].  We  then  calculated  (positive  and 
negative)  facet  scores  for  each  retrieved  document  by  aver¬ 


aging  over  the  lexicon  weights  for  each  word  in  the  document 
(see  equation  14  below.) 

In  order  to  calculate  both  positive  ( opinionated )  and  neg¬ 
ative  ( factual )  facet  weights  for  terms  we  split  the  Mutual 
Information  metric  into  two  values  as  follows.  Let  T  denote 
the  event  that  a  document  contains  the  particular  term  f, 
and  T  the  event  that  the  document  doesn’t  contain  the  term. 
Then  let  O  denote  the  event  that  a  document  is  classed  as 
being  (relevant  and)  opinionated  about  the  query  and  0  that 
it  is  (relevant  but)  not  opinionated  about  the  query.  We  cal¬ 
culate  the  positive  facet  score  for  a  term  by  calculating  the 
MI  summation  only  over  the  two  positively  correlated  quad¬ 
rants  (i.e.  Tfl0  and  T  D  0)  as  follows: 


- P(T' 01  log  Mm + p(f  • S)  log  mm 

(12) 

The  negative  facet  score  is  calculated  analogously  as  follows: 
neg{t)  =  p{T ,  O)  log  +  p(T,  O)  log  -  p(T’  °] 


•p(T),p(0) 


’ P(T),P(0 ) 
(13) 

We  calculate  the  required  joint  and  marginal  probabilities 
using  document  frequency  estimates  using  the  sets  of  opin¬ 
ionated  O  and  relevant  R  documents  in  the  TREC  Blog06 
collection  as: 


P(T,0)  =  df  (t,0)/\R\ 
p(T)  =  df(t,R)/\R\ 
p(0)  =  \0\/\R\ 

Where  df  (t,  O)  is  the  number  of  opinionated  documents  con¬ 
taining  the  term  t.  The  other  joint  and  marginal  probabil¬ 
ities  required  for  equations  12  and  13  are  estimated  analo- 


gously. 

Having  calculated  positive  and  negative  weights  for  each 
term,  we  then  averaged  these  lexicon  weights  over  each  doc¬ 
ument  to  calculate  positive  and  negative  facet  scores  for  the 
document  as  follows: 

pos(d )  =  Ed\pos(t)]  =  J2p(t\d)Pos(t)  (14) 

ted 

2.3.3  Personal  versus  Official 

Finally  for  the  personal  versus  official  facet,  the  same 
scores  were  used  as  in  the  opinion  case,  since  we  believe 
that  more  “personal  content”  is  on  the  whole  more  likely  to 
contain  opinions  than  more  “official  content”. 

3.  TOP  STORIES  IDENTIFICATION 

Our  method  for  the  top  stories  task  proceeded  as  follows. 
We  first  extracted  time-stamped  news  stories  for  each  query 
date  while  filtering  out  non-news  related  items.  For  each 
query  date  we  also  extracted  the  set  of  blog  posts  that  were 
posted  on  the  same  or  following  days  and  where  the  post  had 
some  vocabulary  overlap  with  corresponding  set  of  news  sto¬ 
ries.  Each  set  of  blog  posts  was  then  clustered  using  an  incre¬ 
mental  clustering  algorithm.  Next  we  ranked  clusters  with 
respect  to  size  and  time-span  in  order  to  identify  the  most 
important  clusters  pertaining  to  the  corresponding  news  sto¬ 
ries.  Finally  we  identified  the  most  authoritative  document 
for  the  10  most  important  clusters  on  each  query  date. 

In  the  following  sections  we  outline  our  approach  in  more 
detail. 

3.1  The  algorithm 

In  this  section  we  present  details  of  our  algorithm.  Our 
method  for  top  stories  task  proceeds  as  follows. 

1.  for  every  query  date  we  extract  a  set  of  time-stamped 
news  stories  by  using  the  date  part  of  the  permanent 
links  of  the  urls  and  we  filter  out  non-news  related 
documents 

2.  for  every  query  date  we  extract  a  set  of  time-stamped 
blog  posts  that  satisfy  the  following  condtions:  (I)  they 
were  posted  on  the  same  day  and  a  following  three 
days  and  (II)  they  have  a  vocabulary  overlap  with  the 
corresponding  time-stamped  sets  of  news  stories. 

3.  we  cluster  every  time-stamped  blog-post  set  using  an 
incremental  clustering  algorithm  whose  details  are  pre¬ 
sented  in  Section  3.2. 

4.  we  identify  the  most  important  clusters  pertaining  to 
the  corresponding  time-stamped  news  stories  by  rank¬ 
ing  clusters  with  respect  to  size  and  time-span. 

5.  we  filter  out  clusters  that  correspond  to  the  news- 
stories  by  using  the  centroid  score. 

6.  we  identify  the  most  authoritative  document  for  the 
top- 10  most  important  clusters  for  every  query  date 
using  the  ranking  algorithm  presented  in  Section  3.3. 


Table  3:  Example  documents  for  which  the  triangle 
inequality  does  not  hold 


doc 

Wl 

w2 

w3 

W4 

w5 

Wq 

d\ 

1 

1 

0 

0 

0 

0 

d2 

0 

1 

1 

1 

1 

0 

d3 

0 

0 

0 

0 

1 

1 

3.2  Incremental  document  clustering  in  slid¬ 
ing  time-window 

In  this  section  we  present  our  implementation  of  an  in¬ 
cremental  variant  of  a  non-hierarchical  document  clustering 
algorithm  using  a  similarity  measure  based  on  nearest  neigh¬ 
bors  (NN-based)  [5,  6,  1]. 

We  preprocess  every  document  using  the  following  steps: 
(I)  HTML  parsing',  (II)  tokenization ;  (III)  stemming',  and 
(IV)  stopwords  removal.  We  represent  a  document  d  using 
the  term  vector  model  where  d  =  [vji,W2,  . . . ,  Wd]  and  Wi  is 
the  weight  of  the  the  z-th  term  (word)  that  was  extracted 
from  the  document  after  the  preprocessing  of  the  original 
document.  The  reason  we  use  a  NN-based  similarity  mea¬ 
sure  between  news  stories  is  because  direct  similarity  mea¬ 
sures  between  two  vectors  like  Euclidean  distance  and  the 
dot  product  have  the  following  problems  in  a  high  dimen¬ 
sional  space:  there  is  an  experimental  evidence  that  they 
are  not  reliable  [5]  and  the  triangle  inequality  does  not  hold. 
For  an  example  where  the  triangle  inequality  does  not  hold 
consider  the  following  three  vectors  representing  hypothet¬ 
ical  documents:  di ,  (fe  and  dz  over  six  terms  in  Table  3. 
Thus,  although  di  is  close  to  dz  by  sharing  one  term  and 
dz  is  close  to  dz  by  also  sharing  one  term  di  and  dz  do  not 
share  any  terms.  There  are  the  following  reasons  why  the 
triangle  inequality  does  not  hold  for  documents:  (I)  diver¬ 
sity  of  term  usage  to  express  the  same  meaning  with  respect 
to  the  same  event,  which  is  aggravated  by  the  fact  that  we 
consider  similarity  between  documents  across  different  news 
sources;  and  (II)  content  of  stories  reporting  the  same  event 
may  change  throughout  time  and  may  use  a  different  vocab¬ 
ulary.  Therefore  the  clusters  containing  the  documents  are 
inherently  non  globular  justifying  the  use  of  the  NN-based 
versus  a  centroid-based  similarity  measure. 

We  perform  the  standard  TF-IDF  weighting  of  the  doc¬ 
ument  term  vector  d:  Wi  =  tfi  ■  idfi,  where:  tfi  is  within 
document  term  frequency  of  term  ti  and  idfi  =  log(IV/d/;)  is 
the  inverse  document  frequency,  where  N  is  the  total  num¬ 
ber  of  documents  in  the  collection  and  d/;  is  the  document 
frequency  of  term  ti  defined  as  the  number  of  documents 
containing  the  term  in  the  collection. 

In  order  to  present  the  clustering  algorithm  we  introduce 

the  following  notation.  Let  simfdi,  dj)  =  ,]k  dJk  be 

IKII  j|<bj| 

the  cosine  similarity  or  content  similarity  between  docu¬ 
ments  di  and  dj,  where  sim(di,  dj)  £  [0,1].  Let  MTd{di) 
be  the  neighborhood  of  d;  defined  as  a  set  of  documents 
for  which  sim(di,d )  >  Td,  where  d  £  AfTd(di).  Let  C  = 
{Ci,  Ci,  ■  ■  ■ ,  Cn }  be  the  set  of  active  clusters  in  the  window. 
Let  C(J\TTd(di))  C  C  be  the  set  of  clusters  that  contain  any 
documents  in  AfTd(di).  Let  MTd{C j ,  di)  be  the  subsets  of 
documents  in  AfTd(di)  belonging  to  cluster  Cj  £  C  such  that 
A Td(Cj,di )  =  0  if  cluster  Cj  has  no  members  in  J\fTd(di). 
Let  A (di,C)  =  sim{di,d)  be  the  similarity  between 


di  and  the  set  of  documents  d  G  C.  Let  df1*  be  the  doc¬ 
ument  frequency  vector  for  stream  i.  Let  currentTime  be 
the  timestamp  of  the  most  recent  document  in  the  window, 

i.e,  the  current  timestamp  of  the  window. 

The  clustering  algorithm  proceeds  as  follows: 

1.  Neighborhood  search:  given  a  new  document  di 
identify  its  neighborhood  A fTd(di) 

2.  Identification  of  a  cluster  that  can  accept  a  new 
document:  For  every  cluster  C  G  C{NTd(di))  com¬ 
pute  A (di,Afrd{C,di)).  Select  a  cluster 

Cmax  —  max  A(di,AfT(C,di)). 

CnC(UT(C,di )) 

If  MTd  ( di )  is  empty  then  create  a  new  cluster  Cnew  for 
di. 

3.  Merging:  merge  every  set  C  G  C(AfTd(di))  \  Cmax 
with  Cmax  • 

For  achieving  an  efficient  neighborhood  search  in  the  win¬ 
dow  we  dynamically  maintain  an  inverted  index  data  struc¬ 
ture  in  the  time-window.  Also  we  maintain  an  independent 
document  frequency  vector  df  ^  for  each  stream  i  in  order 
to  suppress  terms  whose  popularity  is  specific  to  a  particular 
news  source. 

The  sliding  window  process  proceeds  as  follows.  When 
a  new  document  df  '1  arrives  the  following  actions  are  ex¬ 
ecuted:  (I)  the  document  is  added  to  the  window,  which 
involves  adding  the  corresponding  terms  to:  the  inverted  in¬ 
dex  and  d/(b  vector;  (II)  d is  clustered  using  the  presented 
algorithm  and  if  the  result  is  a  singleton  cluster  then  it  is 
added  to  the  set  of  active  clusters  C;  (III)  currentTime  is 
set  to  d\l\timestamp\  (IV)  documents  which  are  older  than 
currentTime  —  w  are  removed  from  the  window,  which  in¬ 
volves  removing  corresponding  entries  in:  the  set  of  active 
clusters  C  and  the  inverted  index 

Thus  the  presented  clustering  algorithm  has  the  following 
parameters:  (I)  the  time-window  size  w  =  24  hours;  (II)  the 
document  similarity  threshold  Td  =  0.5.  Our  evaluation  of 
the  clustering  results  suggest  that  Precision  =  95%.  We 
selected  Td  =  0.5  based  on  an  experimental  evaluation  that 
showed  Td  =  0.5  to  be  a  good  compromise  with  respect  to 
precision  and  recall. 

3.3  Content-aware  ranking  function 

In  this  section  we  present  a  content-aware  ranking  func¬ 
tion  that  ranks  with  respect  to  the  following  factors: 

1.  the  importance  of  a  cluster  increases  with  its  size  and 
decreases  with  its  time-span  (the  time  distance  be¬ 
tween  the  first  and  the  last  document) 

2.  the  importance  of  a  document  in  a  given  position  (its 
authority)  in  a  time-ordered  cluster  is  proportional  to 
the  difference  between  the  average  combined  similarity 
(content  similarity  and  temporal  distance)  for  the  fol¬ 
lowing  documents  and  the  previous  documents  in  the 
cluster. 

The  first  factor  is  an  extension  of  the  first  factor  for  the 
probabilistic  ranking  function  by  prioritizing  clusters  that 
are  proximate  in  time.  This  corresponds  to  the  fact  that 
a  large  cluster  on  a  given  event  that  is  proximate  in  time 


means  that  the  event  is  very  important  since  every  source 
reports  it  in  a  very  short  time  window. 

The  second  factor  has  the  following  motivation.  It  is 
known  that  news  stories  discussing  the  same  event  tend  to  be 
temporally  proximate  across  the  news  streams  [14].  There¬ 
fore  we  use  a  combine  similarity  measure  that  increases  with 
the  content  similarity  and  decreases  with  the  temporal  dis¬ 
tance.  Let  A t(i,j)  be  the  temporal  distance  between  docu¬ 
ments  di  and  dj,  where  A t(i,j)  =  and 

a  =  —  ln{.dF^ctor'>  j  where  dF actor  is  the  decaying  factor  that 
denotes  the  factor  by  which  the  value  of  the  function  de¬ 
cays  within  the  time  interval  w  being  the  time  window  size. 
Then  the  combined  similarity  w(di,  dj),  can  be  expressed  as 
follows 


w(di,dj)  =  sim(i,j)  ■  A (15) 

where  sim(i,j)  is  the  content  similarity.  Figure  3  presents  a 
graphical  representation  of  the  dependencies  between  docu¬ 
ments  in  a  cluster  with  respect  to  the  combined  similarity, 
where  a  directed  edge  from  an  earlier  to  a  more  recent  doc¬ 
ument  has  a  weight  equal  to  the  combine  similarity. 


w(d0,  d3) 


Prev(s ,  d\  .steam,  1)  —  {do}  Follow  {s,  d\.  steam,  1)  =  {^2,^3} 
In(s,  d±. steam,  1)  =  w(0,  1)  Out(s,  di. steam,  1)  =  ’w(1>2)+w(1’3) 
authority(s ,  di. stream,  1)  =  TJJ(1>2)+W(1»3)  _ 

Figure  3:  Combined  similarity  between  docu¬ 

ments  in  itemset-sequence  (cluster)  s  =  [di, d2, d3, ^4]. 
Prev(s,  d\. stream,!)  and  Follow(s,di.  stream,  1)  are 
sets  of  documents  preceding  and  following  doc¬ 
ument  di  in  position  1.  In{s,d\.  stream,  1)  and 
Out(s,d2.stream,  1)  are  the  average  combined  similar¬ 
ity  for  Prev{s,d\. stream,  1)  and  F oil ow(s,d1. stream, 1) 
respectively,  author ity{s,d\. stream,  1)  is  the  authority 
of  source  di. stream  in  position  1. 


Now  we  define  the  average  combined  similarity  with  re¬ 
spect  to  previous  and  following  documents.  Given  an 
itemset-sequence  s  G  S  we  define  the  following  two  sets. 
Let  Prev(s,i,  j)  =  (J;=1  t<j  di  be  the  set  of  documents  that 
precede  document  d^  .stream  =  i  in  position  j  in  s.  Let 
Follow(s,  i,j)  =  (Ji=j+i  ;<|s|  di  be  the  set  of  documents  that 

follow  document  d^ .stream  =  i  in  position  j  in  s.  Then  the 
average  combined  similarity  with  respect  to  previous  docu¬ 
ments  (in  positions  j '  <  j),  denoted  In(s,i,j),  can  be  ex¬ 
pressed  as  follows 


In(s,i,j) 


1 

\Prev(s,i,j)\ 


w(d,dj). 

d£Prev(s,i,j ) 


(16) 


Also  the  average  combined  similarity  with  respect  to  the  fol¬ 
lowing  documents  (in  positions  j  <  j'),  denoted  Out(s,  i,j), 


can  be  expressed  as  follows 


Out(s,  i,j) 


1 

\Follow(s,i,  j)\ 


w(d,dj). 

d£Follow(s,i,j ) 


(17) 


Given  the  value  of  In(s,i,j)  and  Out(s,i,  j)  we  define  ’’au¬ 
thority”  of  source  i  corresponding  to  a  document  in  position 
j  as  follows 

authority(s,  i,j )  =  Out(s,  i,  j)  —  In(s,  i,j).  (18) 


This  measure  of  authoritativeness  prioritizes  sources  that: 

(I)  “borrow”  little  content  form  previous  documents 
(In(s,i,  j))  and  whose  content  is  widely  “borrowed”  by  fol¬ 
lowing  documents  in  the  cluster  ( Out(s,i,j ))  and  (II)  pro¬ 
duce  a  timely  content  (\Follow(s,i,  j)\  is  the  biggest  equal 
to  |s|  —  1  and  \Prev(s,i,j)\  is  the  smallest  equal  to  0  for 
the  first  story  in  the  cluster  (j  =  0)).  This  measure  of  au¬ 
thoritativeness  has  many  desired  properties.  For  example 
consider  a  case  where  there  is  source  12,  which  always  fol¬ 
lows  an  authoritative  source  i\  with  very  similar  content. 
Then  author  ity(s,i2,  j)  will  be  very  small  (even  negative) 
for  12  since  it  only  “repeats”  the  content  of  i\.  Thus,  this 
case  may  correspond  to  a  reuse  of  content  by  *2  from  i\ , 
where  *2  repeats  content  from  i\  within  a  short  time  win¬ 
dow.  In  other  words  (18)  discriminates  between  “producers” 
of  the  content  (positive  value  of  (18))  and  “repeaters”  (nega¬ 
tive  value  of  (18)).  However,  note  that  because  of  limitations 
of  the  cosine  similarity  measure  we  are  unable  to  decide  with 
hundred  percent  confidence  that  one  story  is  reusing  content 
from  another  one. 

We  now  define  the  rank  of  a  cluster  s  as  follows 


rankC luster (s)  =  wciuster(k)  ■  At(0,  |s|  —  1)  (19) 

where  wciU3ter(k)  is  the  weight  of  the  cluster  of  size  k  (size  of 
the  cluster)  and  A*(0,  |s|  —  1)  is  the  time-span  of  the  cluster. 

Despite  the  sophisticated  clustering  machinery  used  in  the 
top  stories  identification,  our  results  were  poor  due  to  the 
fact  that  we  were  only  able  to  run  the  clustering  over  a  small 
subset  (around  10%)  of  the  data.  It  was  mainly  because  of 
the  time  restriction  and  the  computational  load  required  by 
the  algorithm  on  the  very  large  dataset. 


4.  CONCLUSIONS 

We  have  described  our  participation  in  TREC  2009  Blog 
track  for  faceted  blog  distillation  and  top  stories.  We  im¬ 
plemented  two  types  of  algorithms  for  blog  distillation.  In 
one  of  our  experiments,  we  used  fuzzy  aggregation  methods 
for  combining  post  relevance  scores  in  each  blog  to  calcu¬ 
late  blog  scores  as  a  whole.  In  another  part  of  the  experi¬ 
ments,  we  used  regularization  methods  for  smoothing  rele¬ 
vance  scores  based  on  the  similarity  between  the  retrieved 
blogs.  We  carried  out  regularization  on  two  types  of  scores: 
posts  relevance  scores  and  large  document  relevance  scores 
(where  each  blog  is  represented  by  the  concatenation  of  its 
most  relevant  posts).  Finally  we  combined  the  two  methods 
(regularization  and  OWA)  to  take  into  account  the  similarity 
between  retrieved  posts  while  performing  good  aggregation 
over  them,  to  generate  new  scores  for  each  blog. 

For  the  faceted  rankings,  we  first  generated  positive  and 
negative  facet  scores  for  each  retrieved  document  and  then 
combined  the  facet  rankings  with  the  relevance  ranking  us¬ 
ing  Borda  Fuse. 


For  top  stories  task  we  first  extracted  time-stamped  news 
stories  for  each  query  date  while  filtering  out  non-news  re¬ 
lated  items.  For  each  query  date  we  also  extracted  the  set 
of  blog  posts  that  were  posted  on  the  same  or  following  days 
and  where  the  post  had  some  vocabulary  overlap  with  corre¬ 
sponding  set  of  news  stories.  Each  set  of  blog  posts  was  then 
clustered  using  an  incremental  clustering  algorithm.  Next 
we  ranked  clusters  with  respect  to  size  and  time-span  in  or¬ 
der  to  identify  the  most  important  clusters  pertaining  to  the 
corresponding  news  stories.  Finally  we  identified  the  most 
authoritative  document  for  the  10  most  important  clusters 
on  each  query  date. 
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